MBD @AP2022

CAPSTONE PROJECT

Group A

Capstone - Churn Definition & Prediction for ClientCo®¶

0. Context¶

The end of this notebook is to define Churn in ClientCo® clients based on the predictions obtained from a combinaitons of analysis and models:

Trend Analysis | RFM Analysis| CLV | K-Means| Isolation Forest

After the analysis a Classification Machine Learning model will be implemented, to predict expected Churn.

+ Info: https://blackboard.ie.edu/ultra/courses/_46938_1/outline/file/_683290_1

1. Framing the Machine Learning Problem¶

Type of ML System¶

  • Supervised Learning: we will create our target's labels in preprocessing stages, and then will use them to predict new ones.[Churn]
  • Batch: the data provided comes in .csv format, converted to a more efficient .parquet.
  • Model based: the intended approach is to generalize from the training set, with a ML model.

Kind of ML Problem¶

  • Binary Classification: our final goal would be to predict if either our client will churn or not.

Performance Metric¶


Summary¶

  • Goal: predict Churn
  • Type of ML System: Supervised
  • Type of ML Problem: Classification
  • Performance Metric: Accuracy

2. Obtaining the Data¶

Let's start off with the data.

First of all, we have a single dataset:

  • data.parquet

Our data is not split for us. We may use an 80% of the complete dataset to for the train and a posterior K-Fold CV to tune the hyperparameters. The spare 20% will be used in the test, to measure how our model generalizes. This is the strategy we will apply.

We will import the self-made mlTools package, in order to import, split, and explore our datset with ease:

In [1]:
SEED = 42
TEST_SIZE = 0.2
SAMPLE_SIZE = 1000000
In [2]:
%matplotlib inline

# General

from mlTools import dataLoader, dataSplitter, dataExplorer, dataProcessor
import numpy as np
import pandas as pd

# Plotting

import matplotlib.pyplot as plt
import plotly.express as px
import phik
import seaborn as sns
from pandas.plotting import scatter_matrix

# RFM
from rfm import RFM

# CLV

from lifetimes.utils import summary_data_from_transaction_data
from lifetimes import BetaGeoFitter
from lifetimes.plotting import plot_frequency_recency_matrix
from lifetimes.plotting import plot_probability_alive_matrix 
from lifetimes.plotting import plot_period_transactions
from lifetimes.plotting import plot_cumulative_transactions 
from lifetimes.plotting import plot_incremental_transactions 

# Clustering
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

# Feature Extraction
from sklearn.decomposition import PCA

# Anomaly 
from sklearn.ensemble import IsolationForest

# Preprocessing
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Training & Validation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Deployment
import pickle
In [3]:
loaderObj = dataLoader()
data = loaderObj.batch_loader("./data/data.parquet",False)

Now that the whole data is imported, let's perform the fixed split we mentioned before:

In [4]:
splitterObj = dataSplitter(data)
train,test = splitterObj.train_splitter("sales_net", TEST_SIZE, SEED, False)
In [5]:
train = train.sample(SAMPLE_SIZE)
test = test.sample(int(SAMPLE_SIZE*TEST_SIZE))

Our data is now imported and split as train and test, both as a pandas DataFrame.

Now, we will perform an Exploratory Data Analysis on the train.

NOTE: The split done is random, and not stratified as the distribution of sales is not unbalanced. However, in other analysis in this same dataset (a.k.a. churn), an stratified split and sampling should be the way to go.

3. EDA¶

In this step of the Pipeline we will grasp for the first time the substance of our data: we will learn what features are present in our trainset, check their types and how their values are distributed.

This step is fundamental to set a strategy around the next step, Data Processing.

3.1. Overview¶

We will start with the basics. As mentioned earlier, we will make use of our own mlToolslibrary, which uses pandas, numpy, matplotliband others as backend.

In this basic overview, the output will be quivalent to using pandas.head(), pandas.info()and pandas.describe() methods.

In [6]:
categorical = ["order_channel","branch_id"]
numerical = ["product_id", "client_id", "sales_net", "quantity"]

explorerObj = dataExplorer(train, categorical, numerical)
explorerObj.basic_explorer("sales_net")
date_order date_invoice product_id client_id sales_net quantity order_channel branch_id
39721857 2018-12-26 2018-12-26 1366829 1905984 1.619200 11 at the store 3439
5891554 2017-12-01 2017-12-01 2512504 861335 0.000000 3 at the store 3417
44786026 2019-02-27 2019-02-27 2975027 69381 182.528000 31 online 20
16295413 2018-03-26 2018-03-27 1895362 1583937 0.414000 3 online 1888
31304247 2018-09-25 2018-09-25 1180925 1600012 63.970728 401 by phone 7101
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 39721857 to 50519451
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   date_order     1000000 non-null  object 
 1   date_invoice   1000000 non-null  object 
 2   product_id     1000000 non-null  int64  
 3   client_id      1000000 non-null  int64  
 4   sales_net      1000000 non-null  float64
 5   quantity       1000000 non-null  int64  
 6   order_channel  1000000 non-null  object 
 7   branch_id      1000000 non-null  int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 68.7+ MB
None
product_id client_id sales_net quantity branch_id
count 1.000000e+06 1.000000e+06 1.000000e+06 1000000.000000 1000000.000000
mean 1.633697e+06 1.139933e+06 1.486738e+02 89.665476 5467.184475
std 9.191704e+05 6.553165e+05 1.895873e+03 818.111417 3176.086567
min 1.500000e+01 6.000000e+00 -1.311241e+05 3.000000 20.000000
25% 8.517370e+05 5.650640e+05 1.407600e+01 3.000000 2907.000000
50% 1.625584e+06 1.152568e+06 4.425195e+01 5.000000 5226.000000
75% 2.436220e+06 1.707522e+06 1.310755e+02 21.000000 8453.000000
max 3.238739e+06 2.274517e+06 1.740456e+06 192001.000000 11057.000000
 0.000000      38340
 2.290800       4809
 4.609200       2582
 1.159200       1735
 0.414000       1525
               ...  
 311.932133        1
 670.128000        1
 684.065387        1
 274.587156        1
-58.627000         1
Name: sales_net, Length: 243428, dtype: int64

Firstly, we see we only have 8 features. Some are Categorical, others Numerical. We must cast both dates, and from it we might obtain interesting new features.

In the end, having 8 features is not sufficient to create a robust Machine Learning model, so some Feature Engineering might be necessary to output a useful model.

From this basic analysis we can also see some of the basic statistics on how the data is distributed.

The method returns some of the statistics behind a Boxplot, such as the quartiles. It also gives some information around the mean and standard deviation of each feature.

From this we can sense the range of values in which the different features range. We can also check the possibility of Outliers.

One possible outlier may reside in the features sales_net and quantity, as their maximum values fall by far out of the IQR. In the following steps, we will further analyze this possibility.

Finally, regarding the count of rows we have per attribute, as also the type of each, using the info() method comes quite handy.


Summary

  • Features: ['date_order', 'date_invoice', 'product_id', 'client_id', 'sales_net', 'quantity', 'order_channel', 'branch_id']

    • Numerical:['sales_net','quantity']
    • Categorical:['product_id', 'client_id', 'order_channel', 'branch_id']
    • Unknown: ['date_order', 'date_invoice']
  • Feature Casting: ['date_order', 'date_invoice']

3.2. Casting¶

From this analysis we also point out that some features are incorrectly formatted, such is the case of date_order and date_invoice.

Even if this step may be considered as part of a processing stage, casting both features is fundamental to perform a more complete exploration. All this changes will also optimize the size-in-memory of the dataset.

Asides from that, some other features should be casted to a more efficient formatting. We will perform all these casts in the following method:

In [7]:
def data_optimize(df):
    df.product_id= df.product_id.astype(np.int32)
    df.client_id= df.client_id.astype(np.int32)
    df.quantity= df.quantity.astype(np.int32)
    df.branch_id= df.branch_id.astype(np.int16)
    df.sales_net=df.sales_net.astype(np.float32)
    df['date_order'] =  pd.to_datetime(df['date_order'], format='%Y-%m-%d')
    df['date_invoice'] =  pd.to_datetime(df['date_invoice'], format='%Y-%m-%d')
    df.order_channel= df.order_channel.astype('category')
    return df

train=data_optimize(train)

Let's see the effect of those changes:

Nice! That's more than a 40% decrease in memory usage! This will make the posterior analysis way more efficient.

From this method, we confirm the types we analyzed before.


Summary

  • Casting:

    • ['date_order', 'date_invoice']-> datetime64
    • ['product_id', 'client_id', 'sales_net', 'quantity', 'branch_id']-> optimized
  • Feature Engineering: ['data_order_year','data_order_month', 'data_order_dayofmonth', 'data_order_dayofweek']


3.3. Erroneous Values¶

But wait a second...

60+ millions of rows?

That's a massive dataset!

We may need to prepare ourselves to perform some sampling in the next steps of the EDA, or find out more efficient ways to import such a dataset.

The dataset is so huge info() doesn't return the whole information in regards to missing values.

Let's explore them with isnull():

In [8]:
print(train.isnull().sum())
date_order       0
date_invoice     0
product_id       0
client_id        0
sales_net        0
quantity         0
order_channel    0
branch_id        0
dtype: int64

Fortunately, there is only a fully NaN value, located in the date_invoice. This might mean lots of things:

  • Unvalid billing information
  • Blocked payment by the bank institution
  • Others

However, we won't bother much. We can safely get rid of this single missing value, as it's not relevant enough.

Let's analyze our target feature, an see if we point any anomalies:

In [9]:
display(explorerObj.outlier_explorer('sales_net'))
None
None

Indeed, there are some anomalies!

There seems to be negative values, which may seem impossible. However, this may be the product of a refund, or incorrect invoice.

Our future strategy will be to get rid of them, as considered erroneous. If our datasethad gave us the chance to track down a payment and refund, a more exhaustive analysis would be done.


Summary

  • Missing Values: date_invoice -> Drop

  • Negative Values: sales_net-> Drop


3.4. Encoding¶

We can also take a look at the categorical features, to check the different values we have.

This step is significant to plan out an strategy to later Encoding.

In [10]:
print(train["order_channel"].value_counts(), train["branch_id"].value_counts())
at the store                       507997
by phone                           401129
online                              89498
other                                 912
during the visit of a sales rep       464
Name: order_channel, dtype: int64 3318    10238
1888     9010
4080     8381
5489     7621
6702     7146
        ...  
807         2
2199        1
524         1
6468        1
7076        1
Name: branch_id, Length: 551, dtype: int64

We can easily check that One Hot Encoding should be the way to go for order_channel, as there increment of features won't be too heavy.

This is not the case for [branch_id, product_id, client_id]. We might be tempted to implement Target Encoding. We will dismiss this possibility for now, as those features won't be useful for the model.


Summary

  • Encoding : order_channel-> One Hot Encoding

3.6. Correlations¶

Speaking of correlations, let's have a look at them.

As at the moment we have both numericaland categoricalfeatures, we can analyize the correlations (both linear an non linear) with the phik_matrix()method.

In [11]:
corr_matrix = train.phik_matrix()
interval columns not set, guessing: ['product_id', 'client_id', 'sales_net', 'quantity', 'branch_id']
In [12]:
sns.set(font_scale=2)
fig, ax = plt.subplots(figsize=(50,50))         
sns.heatmap(corr_matrix, annot=True, linewidths=.5, ax=ax)
Out[12]:
<AxesSubplot:>

We shouldn't fool ourselves: the correlations don't look promising to establish an strategy. Small to no correlation is observed in between the default features (if we dismiss the obvious duplicated dates).

As an Supervised Learning system, we can compare the correlations to a target, in this case, to sales_net. In this aspect, and as obvious as it seems, the quantity shows a strong correlation to the sales_net.

We can, at least, see if there is any interesting interaction in between the numericalfeatures we described previously:

In [13]:
numerical = ["sales_net", "quantity"]
scatter_matrix(train[numerical], figsize = (10, 10));

From the scatter_matrix()we can not only see the distribution of each feature, but also the interactions in between them.

3.7. Seasonality¶

Another interesting element to study is if there is any observable seasonality on the trends of our transactions. Let's plot how the clients behaved during the years 2017-2019:

In [14]:
train_seasonality = train.reset_index().set_index('date_order')['sales_net'].resample(rule='MS').sum().to_frame()
In [15]:
plt.figure(figsize=(30,20)) 
train_seasonality.plot();
<Figure size 2160x1440 with 0 Axes>

From the plot we can spot some patterns in October, where after some growth, it slowly decades until February. From February - March there seems to be a consistent growth, followed by a drop in April-May.

Then, another growth in sales is experienced during Jun-Aug, followed by a drop after it, and beginning again the cycle.

3.8 Scaling¶

Being an Supervised Learningsystem, we should expect the usage of classic Regression algorithms.

Even though it's not mandatory on selected models, it is recommended scaling our data. The reason for this is to prevent the model from compensating the traineable parameters due to a significant scale difference.

There are no requirements in regards to anomalous data, and therefore, a simple StandardScaler() should be enough.


3.9 Extra: Ensemble Analysis¶

The concepts analyzed in this EDA stage, up until now, are very common. However, even if our goal is to construct a final Supervised Learning Binary Classifier, we must recall we still do not have our labels: we must create them.

In order to create them, we have to analyze the probability of Churn from various perspectives. In the steps that follow this EDA, we will discuss different perspectives that may guide us to our desired labels.

The analysis we will perfom include:

  • Trend Analysis: analyze whether a customer is likely to churn if their delay in the last orders is higher than his own average.

  • RFM: a classic technique in Marketing Segmentation, RFM still may be useful to cluster customers depending on their Recency, Frequency and Monetary scores.

  • K-Means: another classic, in this case as an Unsupervised Learning model to cluster data into groups by similarity.

  • Isolation Forest: a particular model used to analyze outliers, in order to track down those Churn instances that behave in a unconventional way.

The predictions from all these analysis will be merged into a single column, Churn, in a similar fashion to a VotingClassifier ensemble model, by manually setting Churn to 0 or 1 depending on the majority vote.

Here is a diagram showing the concept:

Let's analyze one by one, a preview on how they would work with our preprocessed sampled_train, to further implement it as a Feature Engineering step.

NOTE: We will heavily sample our train, as the size of the data limits the ability to compute. The chosen sampling method is random, and not stratified. This should be enough to get a sense of what features might be explanatory of our targets' behaviour.

Trend Analysis¶

One of the most primitive ways to analyze whether a customer is likely to churn or not, is to analyze the delay between orders: if the delay from the last kperiods is superior to that customer's average, we may induce the customer is going to churn.

Let's see how we could implement this:

  • sort_values
  • Create prev_order feature
  • Compute delay between date_order and prevorder`
  • Compute avg_delay by averaging the delays from groupby.
  • If delay>avg_delay in the last k=2 orders, provided they are not done very recently, then Churn.
In [16]:
sampled_train = pd.read_csv('./processed/sample_example.csv', index_col=[0])
In [17]:
sampled_train['date_order'] = pd.to_datetime(sampled_train['date_order'], format='%Y/%m/%d')
In [18]:
sampled_train = sampled_train.sort_values(['client_id', 'date_order'])
sampled_train['prev_order'] = sampled_train['date_order'].shift(2)
sampled_train['delay'] = sampled_train['date_order'] - sampled_train['prev_order']
sampled_train['avg_delay'] = sampled_train.groupby('client_id')['delay'].transform('mean')
sampled_train['trend_pred'] = sampled_train['delay'].gt(sampled_train['avg_delay']).astype(int)
sampled_train['trend_pred'] = sampled_train.groupby('client_id')['trend_pred'].transform(lambda x: 1 if (x == 1).any() else 0)
num_orders = sampled_train.groupby('client_id').size()
last_order_date = sampled_train.groupby('client_id')['date_order'].last()
if ((num_orders < 2) & last_order_date.dt.year.eq(2019) & last_order_date.dt.quarter.eq(4)).any():
    sampled_train['trend_pred']=0
sampled_train = sampled_train.drop(['prev_order', 'delay', 'avg_delay'], axis = 1)
sampled_train['trend_pred'] = sampled_train['trend_pred'].astype(np.uint8)
In [19]:
sampled_train['trend_pred'].value_counts()
Out[19]:
0    5699
1    3560
Name: trend_pred, dtype: int64

RFM Analysis¶

One of our main pillars for the Capstone is launching an RFM Model.

But what is it? And how this analysis may be useful for our models?

Let's find that out!

Concept

RFM is a behavioural segmentation method to analyze customers, from transactional data.

The following is what RFM stands for:

  • Recency shows you how recently a customer bought from your store.

  • Frequency reflects how often a customer purchases from your brand.

  • Monetary value represents how much a customer usually spends with your store.

In a nutshell, using those 3 parameters, we will be able to score and cluster clients, used later to launch the specific campaigns in order to maximize sales_net.

Usefulness

Using the rfm package, the dataset will be enriched by the addition of distict new features:

Features: -> [recency,frequency,monetary_value,r,f,m,rfm_score,segment]

Those new features may be useful both to predict sales, attending to the segment in which the client stands and to cluster clients.

Example of Implementation

In [20]:
r = RFM(sampled_train, customer_id='client_id', transaction_date='date_order', amount='quantity')

From the generated DataFrame, r.rfm_table, we see how the clients were scored according to the recency, frequency and monetary value they brought to the company.

Asides from that, attending to an standard classification, clients were clustered into distinct groups:

  • Champions -> {R: 4-5, F: 4-5, M: 4-5}
  • Promising -> {R: 4-5, F: 1-2, M: 3-4-5}
  • Loyal Accounts -> {R: 3-4-5, F: 3-4-5, M: 3-4-5}
  • Potential Loyalist -> {R: 3-4-5, F: 2-3, M: 2-3-4}
  • New Active Accounts -> {R: 5, F: 1, M: 1-2-3-4-5}
  • Low Spenders -> {R: 3-4-5, F: 1-2-3-4-5, M: 1-2}
  • Need Attention -> {R: 2-3, F: 1-2, M: 4-5}
  • About to Sleep-> {R: 2-3, F: 1-2, M: 1-2-3}
  • At Risk-> {R: 1-2, F: 1-2-3-4-5, M: 3-4-5}
  • Lost-> {R: 1-2, F: 1-2-3-4-5, M: 1-2}

+ Info: https://github.com/sonwanesuresh95/rfm/blob/main/rfm/rfm.py

Now, this classification is standard, which means is not specific and maybe not suitable for our company. A further study on the way our clients behave should be done to drill down the most suitable classification. However, at the moment, we will accept this standard as correct.

We can easily plot how clients are distributed according to the three parameters:

In [21]:
r.plot_rfm_histograms()

But a more useful visualization would be a Treemap, in which to see, proportionally, the percentage of each cluster in our dataset.

To implement it, we will use plotly.express.

In [22]:
labels = list(r.segment_table['segment'])
values = list(r.segment_table['no of customers'])
In [23]:
percentages = []
for value in values:
    percentages.append(int(value/sum(values)*100))
r.segment_table["perc"] = percentages
In [24]:
r.segment_table = r.segment_table.drop("no of customers", axis = 1)
In [25]:
r.segment_table
Out[25]:
segment perc
0 Champions 10
1 Loyal Accounts 16
2 Low Spenders 17
3 Potential Loyalist 4
4 Promising 6
5 New Active Accounts 1
6 Need Attention 5
7 About to Sleep 5
8 At Risk 18
9 Lost 12
In [26]:
fig = px.treemap(r.segment_table, path=r.segment_table.columns, values = percentages)
fig.show()

That's a better visualization!

We can clearly spot how the different cluster generated by the RFM analysis, are distributed.

We can also try to plot each instance in a 3d projection:

In [27]:
enc = OrdinalEncoder()
r.rfm_table[['segment']] = enc.fit_transform(r.rfm_table[['segment']])
scaler = MinMaxScaler()
r.rfm_table[['recency', 'frequency', 'monetary_value']] = scaler.fit_transform(r.rfm_table[['recency', 'frequency', 'monetary_value']])
r.rfm_table["client_id"] = r.rfm_table["client_id"].astype('int64')
sampled_train = sampled_train.merge(r.rfm_table, on='client_id', how='inner')
sampled_train = sampled_train.drop(['r','f','m','rfm_score'], axis = 1)
In [28]:
fig = px.scatter_3d(r.rfm_table, x='frequency', y='recency', z='monetary_value',
                    color='segment')
fig.show()

Just for the sake of experimenting, we can try setting a threshold for Churn: if a client belongs to segment 6 or higher, then it's Churn(either Lost, About to Sleep, At Risk).

In [29]:
r.rfm_table['Churn'] = r.rfm_table['segment'].apply(lambda x: 0 if x < 6 else 1)
In [30]:
fig = px.scatter_3d(r.rfm_table, x='frequency', y='recency', z='monetary_value',
                    color='Churn')
fig.show()

CLV Analysis¶

Let's move on to another classic analysis made for customer segmentation: CLV:

CLV is a technique used to predict the total value that a customer will bring to a business over their relationship. It's based on the value brought by the customer and the duraiton of the relationship with the company.

Let's see its implementation with the lifetime package:

In [31]:
sampled_train = pd.read_csv('./processed/sample_example.csv', index_col=[0])
In [32]:
plt.rcParams["figure.figsize"] = (20,10) 
lt_RFM = summary_data_from_transaction_data(sampled_train,
'client_id', 'date_order', 'sales_net', observation_period_end='2019-09-20')
In [33]:
bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(lt_RFM['frequency'], lt_RFM['recency'], lt_RFM['T']);
In [34]:
plot_frequency_recency_matrix(bgf);
In [35]:
plot_probability_alive_matrix(bgf);
In [36]:
t = 3
lt_RFM['predicted_purchase'] = bgf.conditional_expected_number_of_purchases_up_to_time(
t, lt_RFM['frequency'], lt_RFM['recency'], lt_RFM['T']) 
lt_RFM.sort_values(by='predicted_purchase').head()
Out[36]:
frequency recency T monetary_value predicted_purchase
client_id
44899 0.0 0.0 725.0 0.0 0.00095
2024114 0.0 0.0 725.0 0.0 0.00095
1371668 0.0 0.0 725.0 0.0 0.00095
203944 0.0 0.0 725.0 0.0 0.00095
274520 0.0 0.0 725.0 0.0 0.00095
In [37]:
 plot_period_transactions(bgf, max_frequency=100).set_yscale('log')
In [38]:
plot_incremental_transactions(bgf, sampled_train, 'date_order', 'client_id', 513, 363, freq='D');
In [39]:
summary_with_money_value = lt_RFM
returning_customers_summary = summary_with_money_value[summary_with_money_value['frequency']>0] 
returning_customers_summary.head()
Out[39]:
frequency recency T monetary_value predicted_purchase
client_id
252 1.0 161.0 347.0 169.01135 0.004269
7121 1.0 20.0 219.0 231.23740 0.004968
7652 1.0 212.0 297.0 58.05660 0.004674
10157 1.0 78.0 494.0 185.45912 0.003348
11469 1.0 79.0 689.0 34.84040 0.002623
In [40]:
returning_customers_summary[['monetary_value', 'frequency']].corr()
Out[40]:
monetary_value frequency
monetary_value 1.000000 -0.005316
frequency -0.005316 1.000000
In [41]:
from lifetimes import GammaGammaFitter
ggf = GammaGammaFitter(penalizer_coef = 0.001)
ggf.fit(
    returning_customers_summary['frequency'],
    returning_customers_summary['monetary_value']
)
Out[41]:
<lifetimes.GammaGammaFitter: fitted with 1346 subjects, p: 3.08, q: 0.69, v: 6.88>
In [42]:
print(
"Expected conditional average revenue: %s, Average revenue: %s" % (
ggf.conditional_expected_average_profit(returning_customers_summary['frequency'],
                                        returning_customers_summary['monetary_value']).mean(), 
    summary_with_money_value[summary_with_money_value['frequency']>0]['monetary_value'].mean()));
Expected conditional average revenue: 197.9877928977734, Average revenue: 175.87519716410674
In [43]:
returning_customers_summary
Out[43]:
frequency recency T monetary_value predicted_purchase
client_id
252 1.0 161.0 347.0 169.011350 0.004269
7121 1.0 20.0 219.0 231.237400 0.004968
7652 1.0 212.0 297.0 58.056600 0.004674
10157 1.0 78.0 494.0 185.459120 0.003348
11469 1.0 79.0 689.0 34.840400 0.002623
... ... ... ... ... ...
2260903 1.0 318.0 354.0 110.142400 0.004407
2263434 1.0 226.0 291.0 76.217400 0.004734
2268557 1.0 474.0 548.0 30.879800 0.003550
2272758 8.0 555.0 634.0 122.753841 0.022653
2273574 3.0 290.0 378.0 31.899467 0.011826

1346 rows × 5 columns

In [44]:
bgf.fit(returning_customers_summary['frequency'], returning_customers_summary['recency'], 
        returning_customers_summary['T'])

print(ggf.customer_lifetime_value(bgf, 
                                  returning_customers_summary['frequency'], 
                                  returning_customers_summary['recency'], 
                                  returning_customers_summary['T'], 
                                  returning_customers_summary['monetary_value'],
                                  time=5, # months
                                  discount_rate=0.01).head(10))
client_id
252      35.971006
7121     47.139106
7652     18.677111
10157    15.547479
11469     1.572376
12251     2.557314
14147     2.887103
14188    51.875260
15024     0.636711
16720    35.133160
Name: clv, dtype: float64
In [45]:
plot_period_transactions(bgf, max_frequency=100).set_yscale('log')
In [46]:
plot_cumulative_transactions(bgf, sampled_train, 'date_order', 'client_id', 513, 363, freq='D');

Cluster Analysis¶

Hooray!

We have now a set of grouped customers according to our RFM and also according to CLV!

Nonetheless, these groups may overlap on each other, and may even be counterituitive to split.

For that reason, and for the sake of experimenting, let's see how K-Means would cluster our customers in comparison to our RFM analysis:

In [47]:
sampled_train = pd.read_csv('./processed/sample_example.csv', index_col=[0])
In [48]:
sampled_train = sampled_train.drop(['date_order'], axis = 1)
sampled_train = sampled_train.astype(float)
In [49]:
sampled_train = sampled_train.dropna()
In [50]:
sse = {}
for k in range(1, 10):
    kmeans = KMeans(n_clusters = k, max_iter=100000).fit(sampled_train)
    sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

This means that, a 90% of the variance (less than a 10% of the Sum of Squares Error) in our RFM Analysis may be explained with just 2 clusters!

Let's apply it:

In [51]:
k = 2
In [52]:
kmeans = KMeans(n_clusters = k, random_state = SEED, max_iter=100000).fit(sampled_train)
sampled_train['kmeans'] = kmeans.predict(sampled_train)

We must always keep in mind that, even thought there is a clear overlap on this visualization, in different hyperplanes this might change. Let's see how this goes if plotted after PCA:

In [53]:
pca = PCA(k)
projected = pca.fit_transform(sampled_train)
scaler = MinMaxScaler()
projected = scaler.fit_transform(projected)
fig = plt.figure(figsize=(10,10))
plt.scatter(projected[:, 0], projected[:, 1],
            c=sampled_train['kmeans'], edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('rainbow', 10))
plt.xlabel('PC1')
plt.ylabel('PC2');

Anomaly Analysis¶

For the sake of experimenting, we can go the extra mile: sometimes, Churn can be an isolated and abnormal behaviour, that may be detected via anomaly detection.

For that reason, we may include the anomalies detected by an Isolation Forest model as a new feature.

let's see if we actually spot any:

In [54]:
sampled_train = pd.read_csv('./processed/sample_example.csv', index_col=[0])
In [55]:
sampled_train = sampled_train.drop(['date_order'], axis = 1)
In [56]:
model = IsolationForest(random_state=SEED)
model.fit(sampled_train)
pred = model.predict(sampled_train)
sampled_train['is_anomaly'] = pred
sampled_train['is_anomaly'] = sampled_train['is_anomaly'].replace(-1, 1).replace(1, 0)
In [57]:
sampled_train['is_anomaly'].value_counts()
Out[57]:
0    9259
Name: is_anomaly, dtype: int64

3.11. EDA Conclusions¶

  • Numerical:['sales_net','quantity']
  • Categorical:['product_id', 'client_id', 'order_channel', 'branch_id']
  • Dates: ['date_invoice', 'date_order']
  • Possible Outliers: ['sales_net']
  • Missing Values: ['date_invoice'] -> Drop instance
  • Erroneous Values:['sales_net'] -> Drop instance
  • One Hot Encoding: ['order_channel', 'segment']
  • Irrelevant Features: ['date_invoice', 'product_id', 'order_channel', 'branch_id']['date_invoice', 'product_id', 'order_channel', 'branch_id']
  • Feature Generation:
    • Numerical: ['data_order_year', 'data_order_month', 'data_order_dayofmonth', 'data_order_dayofweek', 'trend', 'segment', 'is_anomaly', 'cluster', 'ensemble']
  • Scaling: MinMaxScaler()

4. Data Preprocessing¶

So, we've seen a lot in this train set. We saw the amount of features, how they are distributed and the possible future combinations that might be useful.

Now it's time to take action, and implement the strategies we planned in the previous step!

Let's go on with the Data Cleaning checklist:

4.1. Missing Values¶

In [58]:
def drop_values(df, feature):
    df[feature] = df[feature].dropna()
    return df
In [59]:
train = drop_values(train, 'date_invoice')

4.2. Erroneous Values¶

In [60]:
def drop_erroneous(df, feature, threshold):
    return df[df[feature]>threshold]
In [61]:
train = drop_erroneous(train, 'sales_net', 0)

4.3. Encoding¶

In [62]:
def encoder_processor(df, feature):
    df = pd.get_dummies(df, columns = feature)
    return df
In [63]:
train = encoder_processor(train, ['order_channel'])

4.4. Irrelevant Features¶

In [64]:
def features_keeper(df, features):
    return df[features]
In [65]:
train = features_keeper(train, ['client_id','date_order','quantity','sales_net'])

4.5. Conventional Feature Generation | Extraction¶

In [66]:
def extract_dates(df):
    df['date_order'] =  pd.to_datetime(df['date_order'], format='%Y-%m-%d')
    df['data_order_year']=df['date_order'].dt.year
    df = encoder_processor(df, ['data_order_year'])
    df['data_order_quarter']=df['date_order'].dt.quarter
    df = encoder_processor(df, ['data_order_quarter'])
    return df
In [67]:
train = extract_dates(train)

4.6. Casting¶

In [68]:
def type_casting(df):
    df.quantity = df.quantity.astype(np.float32)
    df.sales_net = df.sales_net.astype(np.float32)
    df.sales_net=df.sales_net.astype(np.float32)
    df.data_order_year_2017=df.data_order_year_2017.astype(np.uint8)
    df.data_order_year_2018=df.data_order_year_2018.astype(np.uint8)
    df.data_order_year_2019=df.data_order_year_2019.astype(np.uint8)
    df.data_order_quarter_1=df.data_order_quarter_1.astype(np.uint8)
    df.data_order_quarter_2=df.data_order_quarter_2.astype(np.uint8)
    df.data_order_quarter_3=df.data_order_quarter_3.astype(np.uint8)
    df.data_order_quarter_4=df.data_order_quarter_4.astype(np.uint8)
    return df
In [69]:
train = type_casting(train)

4.7. Scaling¶

In [70]:
numerical = ['quantity', 'sales_net']
In [71]:
def numerical_scaler(df, numerical):
    '''
    Scales the numerical features, with an StandardScaler()/MinMaxScaler()
    '''
    scaler = MinMaxScaler()
    df[numerical] = scaler.fit_transform(df[numerical])
    return df
In [72]:
train = numerical_scaler(train, numerical) 

4.8. Advanced Feature Generation¶

In [73]:
def trend_processor(df, periods=1):
    df = df.sort_values(['client_id', 'date_order'])
    df['prev_order'] = df['date_order'].shift(periods)
    df['delay'] = df['date_order'] - df['prev_order']
    df['avg_delay'] = df.groupby('client_id')['delay'].transform('mean')
    df['trend_pred'] = df['delay'].gt(df['avg_delay']).astype(int)
    df['trend_pred'] = df.groupby('client_id')['trend_pred'].transform(lambda x: 1 if (x == 1).any() else 0)
    num_orders = df.groupby('client_id').size()
    last_order_date = df.groupby('client_id')['date_order'].last()
    if ((num_orders < periods) & last_order_date.dt.year.eq(2019) & last_order_date.dt.quarter.eq(4)).any():
        df['trend_pred']=0
    df = df.drop(['prev_order', 'delay', 'avg_delay'], axis = 1)
    df['trend_pred'] = df['trend_pred'].astype(np.uint8)
    return df
In [74]:
train = type_casting(train)
train = trend_processor(train, 1)
In [75]:
def rfm_processor(df):
    df.quantity = df.quantity.astype(np.int64)
    r = RFM(df, customer_id='client_id', transaction_date='date_order', amount='quantity') 
    r.rfm_table["client_id"] = r.rfm_table["client_id"].astype('int64')
    df = df.merge(r.rfm_table, on='client_id', how='inner')
    enc = OrdinalEncoder()
    df[['segment']] = enc.fit_transform(df[['segment']])
    df = df.drop(['r', 'f', 'm', 'rfm_score'], axis = 1)
    df = numerical_scaler(df, ['recency', 'frequency', 'monetary_value'])
    df['rfm_pred'] = df['segment'].apply(lambda x: 0 if x < 6 else 1)
    df = df.drop('segment', axis = 1)
    return df
In [76]:
train = type_casting(train)
train = rfm_processor(train)
In [77]:
def cluster_processor(df, k):
    kmeans = KMeans(n_clusters = k, random_state = SEED, max_iter=1000).fit(df)
    df['kmeans_pred'] = kmeans.predict(df)
    return df
In [78]:
train['sales_net'] = pd.to_numeric(train['sales_net'], errors='coerce')
train = type_casting(train)
train = train.drop(['date_order'], axis = 1)
train = train.dropna()
train = cluster_processor(train, 2)
In [79]:
def anomaly_processor(df):
    model = IsolationForest(random_state=SEED)
    model.fit(df)
    pred = model.predict(df)
    df['anomaly_pred'] = pred
    df['anomaly_pred'] = df['anomaly_pred'].replace(-1, 1).replace(1, 0)
    return df
In [80]:
train = anomaly_processor(train)
In [81]:
def churn_imputer(df):
    df['majority'] = df[['trend_pred', 'rfm_pred', 'kmeans_pred', 'anomaly_pred']].sum(axis=1)
    df['Churn'] = df['majority'].apply(lambda x: 1 if x >= 2 else 0)
    df = df.drop(['trend_pred', 'rfm_pred', 'kmeans_pred', 'anomaly_pred', 'majority'], axis = 1)
    return df
In [82]:
train['Churn'] = 0
train = churn_imputer(train)
In [99]:
fig = px.scatter_3d(train, x='frequency', y='recency', z='monetary_value',
                    color='Churn')
fig.show()
In [84]:
train.Churn.value_counts()
Out[84]:
0    830793
1     11258
Name: Churn, dtype: int64

5. Model Training¶

After exploring and processing our train set, it's time to set a first benchmark model. The only intention of this model is to ground up some base from which we can contruct and improve our model.

As we mentioned, after the preprocessing stage, our problem is a Binary Classification. Therefore, the most basic starting point is the Logistic Regressionmodel.

However, this simple model, as it usually happens with too simple models, often ends up underfitting. To create more ellaborated and complex models, we might be interested in evaluating others such as:

  • Decision Trees
  • Random Forest and variations

For each of them, we will measure it's performance, with the selected metric, and always attending to a cross-validation fold approach. This is fundamental to have a more realistic score from the training phase.

In [85]:
X = train.drop(['client_id', 'Churn'], axis = 1)
y = train['Churn']

5.1. Logistic Regression¶

Similarly to Linear Regression, Logistic Regression is based on computing a weighted sum of input features, plus a final bias term. Nonetheless, over that result we apply the logistic function, creating a sigmoid function which clearly separates two classes: 0 and 1.

We will use a variation from the traditional Logistic Regression: LogisticRegressionCV . With it, we will be able to apply Cross-Validation to our model.

Cross Validation creates virtual splits on our trainset during runtime, as many as we set. Then, it uses, in each iteration, each of the folds as fixed validation sets. Finally, it returns back an averaged result, giving a more refined and realistic result.

We could also apply regularization to our model, but it may not be even necessary, as we do not expect it to overfit our set (it's a simple model).

Let's create our first model!

In [86]:
model = LogisticRegressionCV(max_iter =5000, n_jobs = -1, cv = 5)
model.fit(X, y)
pred = model.predict(X)

5.2. Decision Trees¶

This is a very different model from Logistic Regression. Of course, this model is much more advanced than the previous, which means we should pay attention to overfitting. This model is also the core of another model we will try later, Random Forest.

As it name indicates, its based on creating decision trees, with variable depth, to drill down the possible outputs attending to different features.

In [87]:
model = DecisionTreeClassifier(max_depth = 10, min_samples_split = 1000)
model.fit(X, y)
pred = model.predict(X)

5.3. Random Forest¶

A Random Forest is yet another Decision Tree, and part of the Ensemble family, trained via the bagging method.

However, to suceed, Random Forest searches for the best feature among a random subset of features.

In [88]:
model = RandomForestClassifier(max_depth=50, 
                                max_features='sqrt',
                                min_samples_leaf=40,
                                min_samples_split=30,
                                n_estimators=1000, 
                                random_state = SEED)
model.fit(X, y)
pred = model.predict(X)

Now, let's see how it performs:

In [89]:
mean_absolute_percentage_error(y, pred)
Out[89]:
101619014667.81773

6. Model Validation¶

Let's analyze how well our model does with unseen data, so we can hacve a clearer idea on how well it really performs. This can be achieved through an iterative process, Cross Validation.

In it:

  • The model is trained within a temporary section of the training set
  • Then, we compute the error between the model's predictions and the real data (the unseen portion)
  • Finally, we iterate again, changing the partitions and avergaing the resulting error from all the iterations.

Let's go!

In [90]:
print(-cross_val_score(model, X, y, scoring = 'neg_mean_absolute_percentage_error', cv=5).mean())
1422667482023.8906

The good thing about CV is that, indeed, we have the ability to use the information of the tests to overfit meaning we can use it to guide our decisions towards tuning the non-traineable parameters (a.k.a. hyperparameters).

This tnuning can be performed both with GridSearchCV(slower to compute, as it checks all possible combinations) and RandomizedGridSearchCV which is faster, but doesn't converge.

A combination of both is usually the best, using first RandomizedGridSearch to guide us in general, and then GridSearchCV to find the best possible hyperparameter.

In [91]:
pipeline = Pipeline(steps=[('model', RandomForestClassifier())]) 
param_grid = {
     'model__n_estimators': [800, 1000],
     'model__criterion': ['gini', 'entropy'],
     'model__max_depth': [30, 40, 50, None],
     'model__min_samples_split': [5, 10, 20, 30],
     'model__min_samples_leaf': [30, 40, 50, 60],
     'model__max_features': ['auto', 'sqrt', 'log2']
    }

estimator = RandomizedSearchCV(pipeline, param_grid, cv=5, n_iter=20, n_jobs=-1)
estimator.fit(X, y)
estimator.best_estimator_
Out[91]:
Pipeline(steps=[('model',
                 RandomForestClassifier(criterion='entropy', max_depth=30,
                                        max_features='log2',
                                        min_samples_leaf=60,
                                        min_samples_split=30,
                                        n_estimators=800))])

7. Model Testing¶

Its time for the truth!

In here we apply the same transformations as we did to our train, to out test, and check the error associated to our final models' predictions.

In [92]:
test
Out[92]:
date_order date_invoice product_id client_id sales_net quantity order_channel branch_id
12229761 2018-02-13 2018-02-14 1200535 772802 15.870000 3 at the store 7203
41656133 2019-01-18 2019-01-18 2153450 51349 49.781200 7 at the store 6654
31711047 2018-09-27 2018-09-27 2118769 54294 37.563600 3 at the store 447
59147903 2019-06-12 2019-08-09 387921 1587101 -61.244400 3 by phone 3054
18757864 2018-04-23 2018-04-23 1854381 1604281 18.713904 201 at the store 9254
... ... ... ... ... ... ... ... ...
14357755 2018-03-07 2018-03-07 1099086 853417 16.863600 3 at the store 9794
16890857 2018-04-02 2018-04-02 3218809 2136393 0.000000 3 at the store 5255
30993627 2018-09-07 2018-09-17 2696726 753504 102.090560 11 by phone 9280
61662690 2019-09-03 2019-09-03 1569835 371158 11.288400 3 at the store 298
51714653 2019-05-23 2019-05-23 2732245 457697 117.410400 3 by phone 4781

200000 rows × 8 columns

In [93]:
test = data_optimize(test)
test = drop_values(test, 'date_invoice')
test = drop_erroneous(test, 'sales_net', 0)
test = encoder_processor(test, ['order_channel'])
test = features_keeper(test, ['client_id','date_order','quantity','sales_net'])
test = extract_dates(test)
test = type_casting(test)
test = numerical_scaler(test, numerical) 
test = trend_processor(test, 1)
test = rfm_processor(test)
test = test.drop(['date_order'], axis = 1)
test['sales_net'] = pd.to_numeric(test['sales_net'], errors='coerce')
test = type_casting(test)
test = test.dropna()
test = cluster_processor(test, 2)
test = anomaly_processor(test)
test['Churn'] = 0
test = churn_imputer(test)
In [94]:
X = test.drop(['client_id', 'Churn'], axis = 1)
y = test['Churn']
In [95]:
pred = model.predict(X)
In [96]:
mean_absolute_percentage_error(y, pred)
Out[96]:
1168808152807.9492
In [97]:
submission = pd.DataFrame(data = pred, columns = ['Churn'])
submission['client_id']=test['client_id']
submission.set_index('client_id', inplace = True)
submission.to_csv('./results/prediction.csv')

8. Model Deployment¶

Using pickle we can export our model to a binary file, which may be later implemented as part of an API.

In [98]:
pickle.dump(model, open('model.pkl', 'wb'))

9. References¶

[1] Blattberg, R. C., Kim, P.-do, & Neslin, S. A. (2008). RFM Analysis. In Database Marketing. Springer.

[2] Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (3rd ed.). O’Reilly Media.